Summary of Data Set

Before the analysis, I would like to have a look at the data structure of wine data set which I combined the data set of red wine and white wine.

Overall Information

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "total.acidity"        "quality"              "color"               
## [16] "quality_level"
## 'data.frame':    6497 obs. of  16 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ total.acidity       : num  8.1 8.68 8.56 11.48 8.1 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ color               : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...
##  $ quality_level       : Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 1 1 1 3 3 1 ...
## [1] 6497   16
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.: 813   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##  Median :1650   Median : 7.000   Median :0.2900   Median :0.3100  
##  Mean   :2044   Mean   : 7.215   Mean   :0.3397   Mean   :0.3186  
##  3rd Qu.:3274   3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 3.000   Median :0.04700   Median : 29.00     
##  Mean   : 5.443   Mean   :0.05603   Mean   : 30.53     
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00     
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.: 77.0        1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :118.0        Median :0.9949   Median :3.210   Median :0.5100  
##  Mean   :115.7        Mean   :0.9947   Mean   :3.219   Mean   :0.5313  
##  3rd Qu.:156.0        3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000  
##  Max.   :440.0        Max.   :1.0390   Max.   :4.010   Max.   :2.0000  
##     alcohol      total.acidity       quality        color     
##  Min.   : 8.00   Min.   : 4.110   Min.   :3.000   red  :1599  
##  1st Qu.: 9.50   1st Qu.: 6.710   1st Qu.:5.000   white:4898  
##  Median :10.30   Median : 7.300   Median :6.000               
##  Mean   :10.49   Mean   : 7.555   Mean   :5.818               
##  3rd Qu.:11.30   3rd Qu.: 8.050   3rd Qu.:6.000               
##  Max.   :14.90   Max.   :16.285   Max.   :9.000               
##  quality_level
##  Low   :2384  
##  Medium:2836  
##  High  :1277  
##               
##               
## 

Red Wine Information

## [1] 1599   16
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol      total.acidity       quality        color     
##  Min.   : 8.40   Min.   : 5.120   Min.   :3.000   red  :1599  
##  1st Qu.: 9.50   1st Qu.: 7.680   1st Qu.:5.000   white:   0  
##  Median :10.20   Median : 8.445   Median :6.000               
##  Mean   :10.42   Mean   : 8.847   Mean   :5.636               
##  3rd Qu.:11.10   3rd Qu.: 9.740   3rd Qu.:6.000               
##  Max.   :14.90   Max.   :16.285   Max.   :8.000               
##  quality_level
##  Low   :744   
##  Medium:638   
##  High  :217   
##               
##               
## 

White Wine Information

## [1] 4898   16
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol      total.acidity       quality        color     
##  Min.   : 8.00   Min.   : 4.110   Min.   :3.000   red  :   0  
##  1st Qu.: 9.50   1st Qu.: 6.570   1st Qu.:5.000   white:4898  
##  Median :10.40   Median : 7.070   Median :6.000               
##  Mean   :10.51   Mean   : 7.133   Mean   :5.878               
##  3rd Qu.:11.40   3rd Qu.: 7.590   3rd Qu.:6.000               
##  Max.   :14.20   Max.   :14.470   Max.   :9.000               
##  quality_level
##  Low   :1640  
##  Medium:2198  
##  High  :1060  
##               
##               
## 

Univariate Plots Section

Quality of Wine

Summary

## wine_data$color: red
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000 
## -------------------------------------------------------- 
## wine_data$color: white
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Standard Deivation

## wine_data$color: red
## [1] 0.8075694
## -------------------------------------------------------- 
## wine_data$color: white
## [1] 0.8856386

As the plot shown above, The histograms are overlapped with the normal distributions which generated by the mean and standard deviation of the variable quality. It is reasonable that the histograms of both red wine and white red fit the normal distributions well. The mode of red wine locates at quality 5 and the one of white wine locates at quality 6. Besides, the quantity of white wine is much more than that of red wine. So it is obvious that the histogram of white wine fit better.


Fixed Acidity of Wine

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.400   7.000   7.215   7.700  15.900

The overall distributions of the wiens are similar to normal distribution. However, the distribution of white wine is hard to observed, but it is more concentrated. On the other hand, the distribution of red wine is more wider and it looks like a right skwed distribution. As the summary results, the range, 4.6 to 15.90 g/dm³, of red wine is wider than that, 3.8 to 14.2 g/dm³, of white wine. Besides, all the data of red wine are higher than that of white wine.

Outliers Handling

Statistic Information

## Red Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  4.6      7.1     7.9     8.32    9.2     15.9    1.741 
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  5.2      7.1     7.7     8.048   8.9     11.8    1.343 
##  ==================================================================
##  White Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  3.8      6.3     6.8     6.855   7.3     14.2    0.844 
##  --------------------------------------------------------
##  without outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  4.8      6.3     6.8     6.81    7.3     8.8     0.727

As we can see the above plots, the distribution of red wine does change too much before and after outlier handling. According to the statistic information, the most values just change very slightly. But the distribution becomes more concentrated a little bit as what standard deivation can tell. Additionally, both histograms of red wine do not fit normal distribution well, because they are more likely right-skewed distribution.

On the other hand, we can see that the original data of white wine has more outliers. the range of x-axis changes much before (from 4 to 14 g/dm³) and after (from 4 to 9 g/dm³). According to the statistic information, most values do not change a lot except the max. value. Besides, both histograms of white wine fit normal distribution well.


Volatile Acidity of Wine

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2300  0.2900  0.3397  0.4000  1.5800

The overall distribution of the wine is more like a right skewed distribution. The overall range is from 0.08 to 1.58 g/dm³. As we can see here, the distribution of white wine is more concentrated than that of red wine.

Outliers Handling

Statistic Information

## Red Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.12     0.39    0.52    0.528   0.64    1.58    0.179 
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.12     0.39    0.52    0.522   0.635   0.98    0.165 
##  ==================================================================
##  White Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.08     0.21    0.26    0.278   0.32    1.1     0.101 
##  --------------------------------------------------------
##  without outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.08     0.21    0.26    0.264   0.31    0.48    0.076

As we can see, the histograms of red wine and white wine before and after outlier handling fit the normal distribution well. And both red wine and white wine have the great outliers on the right sides. After the outlier handling, we can see that both distribution are more concentrated, especially the one of white wien. On the other hand, it is more clear that there are two peaks in the distribution of red wine. According to the statistic information, most values just change too slightly to notable. But the max values change very much, just as the same as the result of fixed acidity.


Citric Acid of Wine

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2500  0.3100  0.3186  0.3900  1.6600

The distribution of white wine can be thought as a edge peak distribution. It looks like the normal distribution except that it has a large peak at one tail. On the other hand, the distribution of red wine is more like Plateau distribution. there are many peaks close together, the top of the distribution resembles a plateau. About the summary result, the range, 0.00 to 1.66 g/dm³, of whtie wien is wider than that, 0.00 to 1.00 g/dm³, of red wine.

Outliers Handling

Statistic Information

## Red Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0    0.09    0.26    0.271   0.42    1   0.195 
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0    0.08    0.23    0.238   0.38    0.69    0.175 
##  ==================================================================
##  White Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0    0.27    0.32    0.334   0.39    1.66    0.121 
##  --------------------------------------------------------
##  without outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.12     0.27    0.31    0.321   0.37    0.52    0.081

After outliers handling, we can find out that the data of both red wine and white wine have the outliers with great value, especially the one of white wine. And it is more clear that the distribution of red wine is the Plateau distribution. On the other hand, the distribution of white fits the normal distribution well, but there is a clear peak value at about 0.5 g/dm³.


Residual Sugar of Wine

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   3.000   5.443   8.100  65.800

The distributions of both wines are right skewed distribution, especially for that the counts concentrates on the ragne between 1.5 to 3.0 g/dm³. About the summary result, the range, 0.60 to 65.80 g/dm³, of whtie wien is wider than that, 0.90 to 15.5 g/dm³, of red wine. Besides, It is obvious that most white wines have more residual sugar than red wines do.

Outliers Handling

Statistic Information

## Red Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.9      1.9     2.2     2.539   2.6     15.5    1.41 
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  1.2      1.9     2.1     2.128   2.4     3.1     0.375 
##  ==================================================================
##  White Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.6      1.7     5.2     6.391   9.9     65.8    5.072 
##  --------------------------------------------------------
##  without outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.6      1.8     5.2     6.36    9.6     22      4.908

In the plots, the distribution after outlier handling of red wine becomes very concentrated (from 1.9 to 3.1 g/dm³) and it fits the normal distribution very well. However, we can find out that the data of white wine have the outliers wiht very great value. After the outliers handling, the range does concentrate very much. But, the disttribution still fits normal distribtuion very badly and it is quite wider (from 0.6 to 22 g/dm³).


Chlorides of Wine

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100

The distributions of both wines are similar to a normal distribution. However, both distributions have a long tail on the right side. But I think it can be thought as the outliers. In the distribution of white wine, most data concentrates on the range 0.02 to 0.12 mg/dm³. And the distrigution of red wine, most data concentrates on the ragnge 0.06 to 0.16 mg/dm³.

Outliers Handling

Statistic Information

## Red Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.012    0.07    0.079   0.087   0.09    0.611   0.047 
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.045    0.069   0.078   0.078   0.086   0.112   0.013 
##  ==================================================================
##  White Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.009    0.036   0.043   0.046   0.05    0.346   0.022 
##  --------------------------------------------------------
##  without outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.015    0.035   0.042   0.042   0.049   0.07    0.01

After the outliers handling, the distributions of both red wine and white wine fit the normal distribution very well.


Free Sulfur Dioxide of Wine

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   29.00   30.53   41.00  289.00

The distribution of white wine is similar to a normal distribution with a long tial on the right side. On the other hand, the distribution of red wine is similar to right skewed distribution. In the distribution of white wine, most data concentrates on the range 0 to 80 mg/dm³. And the distrigution of red wine, most data concentrates on the ragnge 0 to 50 mg/dm³.

Outliers Handling

Statistic Information

## Red Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  1    7   14      15.87   21      72      10.46 
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  1    8   13      15.08   20      42      8.885 
##  ==================================================================
##  White Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  2    23      34      35.31   46      289     17.007 
##  --------------------------------------------------------
##  without outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  2    24      34      34.59   45      78      14.838

Both red wine and white wine have the outliers with great values. After the outlier handling, it is more clear that the distribution of red wine is more likely a right skewed distribution. On the other hand, the distribution of white wine fits the normal distribution very well.


Total Sulfur Dioxide of Wine

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    77.0   118.0   115.7   156.0   440.0

The distribution of red wine is similar to right skewed distribution with the outliers. On the other hand, the ditstribution of white wien is bimodal distribution with the outliers. About the summary data, all the data of white wine are higher than that of the red wine.

Outliers Handling

Statistic Information

## Red Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  6    22      38      46.47   62      289     32.895 
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  6    22      35      41      54      111     24.496 
##  ==================================================================
##  White Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  9    108     134     138.4   167     440     42.498 
##  --------------------------------------------------------
##  without outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  21   107     132     136.8   166     255     41.045

In this plots, red wine has the outliers with very great value. After the outlier handling, the range becomes as 1/3 as original one. The distribution does not change much before and after the outlier handling. On the other hand, the values of outliers of white wine is not so great based on the new range (from 20 to 260 mg/dm³) of the distribution. Similarly, the distribution of white wine does not change before and after the outliers handling, and both of them fit the normal distribution well.


Density of Wine

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9923  0.9949  0.9947  0.9970  1.0390

The distribution of white wine can be thought as bimodal distribution and the one of red wine is similar to a normal distribution. And most counts of both wines are around 0.995 g/dm³.

Outliers Handling

Statistic Information

## Red Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.99     0.996   0.997   0.997   0.998   1.004   0.002 
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.992    0.996   0.996   0.996   0.997   1.001   0.002 
##  ==================================================================
##  White Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.987    0.992   0.994   0.994   0.996   1.039   0.003 
##  --------------------------------------------------------
##  without outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.987    0.992   0.994   0.994   0.996   1.002   0.003

In this plot, the data of white wine has the outliers with relatively great value. On the other hand, the range of red win does not change too much. After the outliers handling, the distribution of red wine still fits the normal distribution well. But the distribution processed the outliers handling becomes like a right skewed distribution a little bit.


pH of Wine

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.110   3.210   3.219   3.320   4.010

The distributions of both wines are similar to a normal distribution. And most data of both wines concentrate on the range 2.7 to 3.7. However, there still are serval outliers over the value 4.0.

Outliers Handling

Statistic Information

## Red Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  2.74     3.21    3.31    3.311   3.4     4.01    0.154 
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  2.94     3.24    3.33    3.331   3.41    3.68    0.129 
##  ==================================================================
##  White Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  2.72     3.09    3.18    3.188   3.28    3.82    0.151 
##  --------------------------------------------------------
##  without outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  2.82     3.1     3.18    3.188   3.28    3.55    0.137

The distributions of both red wine and white wine almost do change before and after the outliers handling. They still fit the normal distribution very well.


Sulphates of Wine

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4300  0.5100  0.5313  0.6000  2.0000

The distributions of both wines are similar to a normal distribution. However, the distribution of red wine has a long tail on right side, which is thought as the outlier. And about the summary data, all the data of red wine are higher than that of the white wine.

Outliers Handling

Statistic Information

## Red Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.33     0.55    0.62    0.658   0.73    2   0.17 
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.33     0.55    0.61    0.629   0.7     0.95    0.113 
##  ==================================================================
##  White Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.22     0.41    0.47    0.49    0.55    1.08    0.114 
##  --------------------------------------------------------
##  without outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  0.22     0.41    0.47    0.478   0.54    0.73    0.094

The data of red wine has the outliers with the values greater than that of white wien. After the outlier handling, the plot shows that the distribution of red wine is like a right skewed distribution. About the white wine, the distribution does not change too much. Just some outliers were removed.


Alcohol of Wine

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90

The distributions of both red wine and white wine are similar. In my opinion, they are similar to right skewed distribuiton with noise, espeically for the distribution of white wine. The most counts of both red wine and white wine locate at 9.5%

Outliers Handling

Statistic Information

## Red Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  8.4      9.5     10.2    10.42   11.1    14.9    1.066 
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  8.7      9.5     10.1    10.36   11      13.1    0.965 
##  ==================================================================
##  White Wine:
##  --------------------------------------------------------
##  with outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  8    9.5     10.4    10.51   11.4    14.2    1.231 
##  --------------------------------------------------------
##  without outliers
##  Min.    1st Qu. Median  Mean    3rd Qu. Max.     SD
##  8.4      9.5     10.5    10.6    11.4    14.2    1.214

After the outlier handling, not only the shape of the distributions but also the ranges of them do not change. And they also do not fit the normal distribution well. There are many peaks over the both distribuiton.


Univariate Analysis

What is the structure of your dataset?

There are 1599 records of red wine and 4898 records of white wine in the dataset with 13 original features and 2 features, color and quality_level, which I added.

About the features, there is one categorical variable, quality, and the others are numerical variables that indicate physical and chemical properties of the wine.

According to the plots, most features’ distributions of both red wine and white wine are similar to normal distribution and skewed distribution except the one of white wine’s total sulfur dioxide and density, which are bimodal distribution distribution, and the one of red wine’s citric acid, which is plateau distribution.

Most distributions of red wine are similar to that of white wine except the ones of citric acid and total sulfur dioxide.

Although the most distributions of red wine and white wine are similar, the values are quite different. Therefore, it is interesting that whether the feautres, which affect the quality of red wine and white wine, will be different or not.


What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is quality. I would like to determine which features are sutible for the predicting the quality of the wine. Furthermore, whether the conclusions of red wine and white wine will be similar or not.


What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

According to the document, wine quality information, I think that the main features should be alcohol, free sulfur dioxide and acidity, especially citric acid.


Did you create any new variables from existing variables in the dataset?

Yes, I created the variable, quality_level, to generate better plots. It has three values, low, medium and high, which are generated by quantile with 1/3, 2/3 and 1 proporation of the current data set. Besides, I also create the variable, total acidity, which is obtained by the summation of fixed acidity and volatile acidity.

In order to do the analysis of outlier handling, I create other three data frames, wine_data_no_color, wine_data_no_quality_level and wine_data_no_quality, to store the data processed by outlier handling. This data frame would be used in bivarite plots/analysis section and multivariate plots/analysis section. The different among these data are the groups during the outliers handling. wine_data_no_color is grouped by wine’s color, wine_data_no_quality_level is grouped by quality level and wine’s color and wine_data_no_quality is grouped by quality and wine’s color. At the beginning, I think that it should be quite different between red wine and white wine. Therefore, almost the data handling were carry out respectively.


Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

As my observation, most distributions are similar to normal distribution. These features can be thought as that most values are close to the mean and median count.However, there are two features, residual sugar and total sulfur dioxide, shows different distribution. In the plot of residual sugar, the distributions of both red wine and white wine are like right-half normal distribution. In the plot of total sulfur dioxide, the distribution of white wine shows that there are two peaks at value 20-30 and 110-120.

Additionally, I performed the analysis of outlier handling. In this section, I just used wine_data_no_color to do the comparison between the original data and outliers handling data. I was not too suprised that the distributions of each feature do not chagne very much before and after the outliers handling. But the outliers handling does help me to observe the distribution of each features and the difference between red wine and white wine.



Bivariate Plots Section

The above plots show the correlations between the features in the dataset seperated by the kind of the wine. A darker color means a stronger correlation. A red color is a positive correlation, where as blue is a negative correlation. The numbers in the boxes are the correlation coefficients.

According to the above plots, it is obvious that the correlation coefficient of each feature with each feature are different between red wine and white wine.

First, the feature of interest of this report is the variable quality. Most influence by other features are similar between red wine and white wine, especially for alcohol, 0.5 in red wine and 0.4 in white wien. On the other hand, fixed acidity has positive correlation, 0.1, with quality in red wine. But it is negative value, -0.1, in white wine.

Second, I picked several high correlations to plot. I classified them into two section. One is “The Features Highly Correlated to Density of Wine”, and the other is “The Features Highly Correlated to Acidity of Wine”. Besides, I also picked the correlation of Total sulfur oxides and free sulfur dioxide because free sulfur dioxide is part of total sulfur dioxide. Thus, the plot of Total sulfur oxides vs free sulfur dioxide should show the good correlation.



The Features Highly Correlated to Density of Wine

Density vs Alcohol

Correlation Coefficient Information

##   color       corr
## 1   red -0.4961798
## 2 white -0.7801376

According to the plot, it is obvious that not onlye the line of the relationship shows that density tends to increase with decreasing alcohol content in both red wine and white wine, but also the distributions of the scatter points do. Besides, there is much higher correlation, -0.78, in white wine than that, -0.49, in red wine.

Outliers Handling

Correlation Coefficient Information

## Grouped by Wine's Color
##  color    corr
##  -------------------------------
##  red      -0.575 
##  white    -0.822 
##  ===============================
##  Grouped by Quality Level and Wine's Color
##  color    corr
##  -------------------------------
##  red      -0.526 
##  white    -0.814

With outlier handling, both data sets increase the correlation between density and alcohol. And the distribution of scatter points of both data sets become narrow and steep. And distributions of both wine look like more oval


Density vs Fixed Acidity

Correlation Coefficient Information

##   color      corr
## 1   red 0.6680473
## 2 white 0.2653310

We can see that it is obvious that density tends to increase with increasing fixed acidity in red wine. But it seems that outliers handling did not change the result. However, this correlation is weak in white wine.

Outliers Handling

Correlation Coefficient Information

## Grouped by Wine's Color
##  color    corr
##  -------------------------------
##  red      0.56 
##  white    0.224 
##  ===============================
##  Grouped by Quality Level and Wine's Color
##  color    corr
##  -------------------------------
##  red      0.555 
##  white    0.217

With outlier handling, both data sets decrease the correlation between density and fixed acidity. And they did not change too much before and after the outliers handling. And the distribution of scatter points of red wine becomes narrow and steep, but it is getting round for white wine.


Density vs Residual Sugar

Correlation Coefficient Information

##   color      corr
## 1   red 0.3552834
## 2 white 0.8389665

We can see that it is obvious that density tends to increase with increasing residual sugar in white wine. However, this correlation is weak in red wine. Besides, most data of white wine are higher than that of red wine.

Outliers Handling

Correlation Coefficient Information

## Grouped by Wine's Color
##  color    corr
##  -------------------------------
##  red      0.353 
##  white    0.843 
##  ===============================
##  Grouped by Quality Level and Wine's Color
##  color    corr
##  -------------------------------
##  red      0.426 
##  white    0.837

After the outliers handling, the correlations between density and residual sugar of both wine are almost as the same as the original data set. After cleaning the outliers, the shape of red wine becomes more narrow, and it looks like a triangle for white wine.


Density vs Total Sulfur Dioxide

Correlation Coefficient Information

##   color        corr
## 1   red -0.02194583
## 2 white  0.29421041

To picked up this correlation is because that it is positive correlation in white wine, but negative correlation in red wine. Although the correlation coefficient, 0.29, of white wine is not so high, the scatter plot shows that density tends to increase with increasing total sulfur dioxide. Besides, most data of white wine are higher than that of red wine.

Outliers Handling

Correlation Coefficient Information

## Grouped by Wine's Color
##  color    corr
##  -------------------------------
##  red      0.154 
##  white    0.553 
##  ===============================
##  Grouped by Quality Level and Wine's Color
##  color    corr
##  -------------------------------
##  red      0.195 
##  white    0.537

After the outliers handling, the correlation of both data set dramatically increase. The correlation of red wine becomes weak relation from no relation. And the one of white wine becomes strong relation from weak relation. It means 2 thing. First, the outliers do affect the relation between density and total surful dioxide. Second, the relation is much stronger than what we thought due to the resultes of original data. And the shape of both distribution of scatter points are getting round.


The Features Highly Correlated to Acidity of Wine

Citric Acid vs pH

Correlation Coefficient Information

##   color       corr
## 1   red -0.5419041
## 2 white -0.1637482

In this plot, not only the relationship line with the correlation coefficient - 0.54 but also the distribution of scatter points of red wine shows that pH tends to decrease with increasing citric acid. On the other hand, not only the distribution of scatter points of white wine shows there is no relation between pH and citric acid, but also the correlation coefficient, -0.16, is weak.

Outliers Handling

Correlation Coefficient Information

## Grouped by Wine's Color
##  color    corr
##  -------------------------------
##  red      -0.441 
##  white    -0.081 
##  ===============================
##  Grouped by Quality Level and Wine's Color
##  color    corr
##  -------------------------------
##  red      -0.457 
##  white    -0.141

The distribution of scatter points of both red wine and white wine become much more clear to observe. It implies that there were lots of outliers in data of both wines. The correlation of both data sets decrease a little bit. But overall tendency is the same in my opinion. Besides, the shape of distribution of scatter points of white wine becomes oval, and it becomes a little bit narrow for red wine.


Volatile Acidity vs pH

Correlation Coefficient Information

##   color        corr
## 1   red  0.23493729
## 2 white -0.03191537

To picked up this correlation is because that it is positive correlation in red wine, but negative correlation in white wine. Besides, it is interesting for the correlation coefficet, 0.23, in red wine, because I was expected that acidity should contribute to decreasing pH rather than increasing pH. Although the scatter points of red wine shows that there is no relation between volatile acidity and pH, the correlation coefficient, 0.23, of red wine is the positive correlation. About white wine, not only the correlation coefficient, -0.032, is quite low value, but also the scatter points shows no relation between volatile acidity and pH. Besides, the data of white wine more concentrates on the range 0.1 to 0.4 g/dm³, but the data of red wine is quite divergent.

Outliers Handling

Correlation Coefficient Information

## Grouped by Wine's Color
##  color    corr
##  -------------------------------
##  red      0.22 
##  white    -0.033 
##  ===============================
##  Grouped by Quality Level and Wine's Color
##  color    corr
##  -------------------------------
##  red      0.197 
##  white    -0.065

Similarly, the distribution of scatter points become much clear. It means that there were many outliers in the data. But the correlation of both data set do not change before and after the outliers handling. The shape of red wine becomes round, and the shape of white wine becomes like a oval.


Fixed Acidity vs pH

Correlation Coefficient Information

##   color       corr
## 1   red -0.6829782
## 2 white -0.4258583

According to the plot, the scatter points of red wine shows that pH tends to decrease with increasing fixed acidity, but the one of white wine does not the tend. And the data of white wine more concentrates than that of red wine. About the correlation coefficient, the one, -0.68, of red wine can be thought strong relation, but the one, -0.42, of white wine is thought as medium relation in my opinion.

Outliers Handling

Correlation Coefficient Information

## Grouped by Wine's Color
##  color    corr
##  -------------------------------
##  red      -0.673 
##  white    -0.377 
##  ===============================
##  Grouped by Quality Level and Wine's Color
##  color    corr
##  -------------------------------
##  red      -0.671 
##  white    -0.373

After the outliers handling, lots of outliers have been removed. But according to the statistic results, the correlation does not change. The shape of distribuiotn of scatter points of white wine becomes like a oval, and it looks much more narrow and steep for red wine.


Volatile Acidity vs Citric Acid

Correlation Coefficient Information

##   color       corr
## 1   red -0.5524957
## 2 white -0.1494718

According to wikipedia, volatile acidty is usually thought as acetic acid. However, not only the correlation coefficient, -0.55, but also the scatter points of the red wine show that volatile acidity tends to decrease with increasing citric acid, although the scatter points of red looks quite divergent. On the other hand, the data of white wine more concentrates on the range 0.1 to 0.4 g/dm³. And the correlation coefficient, -0.15, indicates that it is a weak relation between volatile acidity and citric acid in white wine.

Outliers Handling

Correlation Coefficient Information

## Grouped by Wine's Color
##  color    corr
##  -------------------------------
##  red      -0.648 
##  white    -0.128 
##  ===============================
##  Grouped by Quality Level and Wine's Color
##  color    corr
##  -------------------------------
##  red      -0.629 
##  white    -0.105

After the outliers handling, the correlation of red wine becomes much stronger, although it was strong relation of the original data set. That means the outliers affect the correlation between volatile acidity and citric acid quite much. On the other hand, the correlation of white wine does not change too much. But the data set become much cleaner. And the shape of the distribution of scatter points of white wine becomes like a oval, and it become more narrow and steep for red wine.


Fixed Acidity vs Citric Acid

Correlation Coefficient Information

##   color      corr
## 1   red 0.6717034
## 2 white 0.2891807

In this plot, the correlation coefficient, 0.67, of red wine indicates that the relation between fixed acidity and citric acid is strong relation, although the scatter point of red wine looks like divergent. On the other hand, the correlation coefficient, 0.29, of white wine is a litttle bit weak, and the data of white wine concnetrates on the range 5 to 9 g/dm³.

Outliers Handling

Correlation Coefficient Information

## Grouped by Wine's Color
##  color    corr
##  -------------------------------
##  red      0.641 
##  white    0.257 
##  ===============================
##  Grouped by Quality Level and Wine's Color
##  color    corr
##  -------------------------------
##  red      0.643 
##  white    0.269

After the outliers handling, we can see that lots of outliers have been removed. But it seems that outliers did not affect the correlation much because the correlation coefficient is almost the same before and after the outliers handling.


Others

Residual Sugar vs Alcohol

Correlation Coefficient Information

##   color        corr
## 1   red  0.04207544
## 2 white -0.45063122

As we can see, the scatter point distribution of red wine concentrates on the left side. No matter what alcohol content is the residual sugar just chage slightly. It is consistent with the correlation coefficient which is almost 0. On the other hand, the scatter point distribution of white wine is much wider, andd the shape is like a triangle. alcohol tends to decrease with increasing residual sugar. Although the correlation coefficient is not so strong, we still observe that it is surely that the relationship between alcohol and residual sugar of whtie wine is negative correlation.

Outliers Handling

Correlation Coefficient Information

## Grouped by Wine's Color
##  color    corr
##  -------------------------------
##  red      0.122 
##  white    -0.49 
##  ===============================
##  Grouped by Quality Level and Wine's Color
##  color    corr
##  -------------------------------
##  red      0.066 
##  white    -0.467

In this plots, we can see that the distribution of red wine becomes very narrow, means that there were lots of outliers in the data and it is turely that there is no relation between alcohol and residual sugar in red wine, although the regression line look quite steep. However, the distribution of scatter points of white wine does not change. Only few outliers have been removed. That means alcohol content and reisdual sugar do affect each other.


Total Sulfur Dioxide vs Free Sulfur Dioxide

Correlation Coefficient Information

##   color      corr
## 1   red 0.6676665
## 2 white 0.6155010

To picked up this correlation is because the correlation coefficients, 0.67 and 0.62 respecitvely, of both red wine and white red are quite strong. As what I expected, the scatter points of both red wine and white wine show that total sulfur dioxide tends to increase with increasing free sulfur dioxide, although the data of both red wine and white wine become divergent on the upper right side. And the most data of white wine are obviously greater than that of red wine.

Outliers Handling

Correlation Coefficient Information

## Grouped by Wine's Color
##  color    corr
##  -------------------------------
##  red      0.629 
##  white    0.619 
##  ===============================
##  Grouped by Quality Level and Wine's Color
##  color    corr
##  -------------------------------
##  red      0.639 
##  white    0.624

In this plot, the correlation was quite strong and the distribution of original data also shows this tendency. The outliers handing only cleanup the data so that the distribution look more clear. The distribution of white wine look like a oval, but it looks like cone shape for red wine.


The Summary of the relationships I Picked Up

                                                    Red     (±)   /   White   (±)
------------------------------------------------------------------------------------
Density               vs    Alcohol                 strong  (-)   /   strong  (-)
Density               vs    Fixed Acidity           strong  (+)   /   weak    (+)
Density               vs    Residual Sugar          weak    (+)   /   strong  (+)

=================================================================================
Density               vs    Total Sulfur Dioxide    no      ( )   /   weak    (+) <- without outliers handling
Density               vs    Total Sulfur Dioxide    weak    ( )   /   strong  (+) <- with outliers handling
=================================================================================

Citric Acid           vs    pH                      strong  (-)   /   weak    (-)
Volatile Acidity      vs    pH                      weak    (+)   /   no      ( )
Fixed Acidity         vs    pH                      strong  (-)   /   medium  (-)
Volatile Acidity      vs    Citric Acid             strong  (-)   /   weak    (-)
Fixed Acidity         vs    Citric Acid             strong  (+)   /   medium  (+)
Residual Sugar        vs    Alcohol                 no      ( )   /   medium  (-)
Total Sulfur Dioxide  vs    Free Sulfur Dioxide     strong  (+)   /   strong  (+)

After the outliers handing, most correlation does not change except the data set of density and total sulfur dioxide, which becomes much stronger correlation from weak or even no relation.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The feature of interest is quality in this report. According to the correlation plot, it is strong relation between quality and alcohol in both red wine and white wine. But the other relations are thought as the weak relations in both red wine and white wine. The interesting things is that the acditiy especially fixed acidity shows the different behaviors between red wine and white wine.


Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I found acditiy and residual sugar have the different influence on density between red wine and white wine. According to the analysis, acditiy especially for fixed acidity has greater influence on density in red wine. On the other hand, residual sugar has greater infleunce on density in white wine.

Considering that quality and alcohol have strongest relation and that density and alcohol have very strong relation. Acidity might be the secondary fact of quality of red wine to be considered and residual sugar might be the secondary fact of quality of white wine to be considered.

Furthermore, the analysis of volatile acidity shows an interesting results. acidity is usually thought to contribute to decreasing pH. However, increasing volatile acidity in red wein contribute to increasing pH.

After the outliers handling, most correlation between each feature do not change or just change a little bit. But overall tendencies are as the same as the original one. They just become much more clear for their shape of the distribution of scatter points. And as the aspect of the different color, it almost makes no differnece between the data sets grouped by color and by quality level (including color). However, I found an interesting case that the correlation between density and total sulfur dioxide become much stronger relation from weak one or even no relation. That means, amoung lots combination of features, I found that the outliers do affect the relation between total sulfur dioxide afftects density. And with the clean data, the relation between density and total sulfur dioxide is much stronger than what we thought due to the resultes of original data.


What was the strongest relationship you found?

Before outliers handling:

In red wine, the strongest relationship is between pH and fixed acidity. The correlation coefficient is -0.683 On the other hand, the strongest relationship is between density and residual sugar. The correlation coefficient is 0.839.

After outliers handling:

In red wine, the strongest relationship is between pH and fixed acidity. The correlation coefficient is -0.673 On the other hand, the strongest relationship is between density and residual sugar. The correlation coefficient is 0.843.



Multivariate Plots Section

Exploring Wine Parameters with Quality Level

Density vs Alcohol

Correlation Coefficient Information

##   color quality_level       corr
## 1   red           Low -0.3104089
## 2   red        Medium -0.5471226
## 3   red          High -0.5841169
## 4 white           Low -0.6801808
## 5 white        Medium -0.7435118
## 6 white          High -0.8436137

In this plot, we can see that the tneds of realtionships between density and alcohol are similar between red wine and white wine. Bacially, the stonger relationships with the higher quality level, especially for white wine at high quality level, the correlation coefficient of that is -0.84. And the overall tends also indicates that the relationships of white wine are stronger than that of red wine.

Outliers Handling

Correlation Coefficient Information

## ===============================================
##  Grouped by Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             -0.454 
##  Red     |    Medium          -0.586 
##          |    High            -0.639 
##  ----------------------------------------------
##          |    Low             -0.706 
##  White   |    Medium          -0.817 
##          |    High            -0.857 
##  ==============================================
##  Grouped by Quality Level and Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             -0.305 
##  Red     |    Medium          -0.572 
##          |    High            -0.621 
##  ----------------------------------------------
##          |    Low             -0.675 
##  White   |    Medium          -0.814 
##          |    High            -0.747 
##  ===============================================

After outliers handling, the distribution of the scatter points of two data sets (different grouping ways) look similar. However, we can find out that the tendencies of white wine are different when we focuse on the correlation coefficient. In the result of the data grouped by wine’s color, the higher the quality level is the stronger the correlation coefficient is. On the other hand, in the result of the data grouped by quality level and wine’s color, the correlation coefficient of medium quality level is higher than that of high quality level.


Density vs Residual Sugar

Correlation Coefficient Information

##   color quality_level      corr
## 1   red           Low 0.4051275
## 2   red        Medium 0.3452290
## 3   red          High 0.3498892
## 4 white           Low 0.8796645
## 5 white        Medium 0.8556329
## 6 white          High 0.8202080

We can see that residual sugar has high relationship with density in white wine. On the other hand, the relationships between residual sugar and density are relaively weak in red wine. Besides, the scatter points of red wine concentrates on the range 1.25 to 2.5 g/dm³, but the scatter points of white wine in the range 1.25 to 20 g/dm³. When we focuses on the plot of white wine, the correlation cofficient is getting lower slightly with increasing quality level.

Outliers Handling

Correlation Coefficient Information

## ===============================================
##  Grouped by Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             0.473 
##  Red     |    Medium          0.269 
##          |    High            0.351 
##  ----------------------------------------------
##          |    Low             0.899 
##  White   |    Medium          0.846 
##          |    High            0.832 
##  ==============================================
##  Grouped by Quality Level and Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             0.596 
##  Red     |    Medium          0.315 
##          |    High            0.367 
##  ----------------------------------------------
##          |    Low             0.915 
##  White   |    Medium          0.846 
##          |    High            0.644 
##  ===============================================

After the outliers handling, the distributions of scatter points of red wine in both data set become very small. But the distributions of white wine do not change too much. When we focuse on the correlation coefficient, the results of red wine do not chage too much. On the other hand, the results of the data grouped by wine’s color are almost as the same as the reuslt of original data. However, the correlation coefficint decrease dramatically with increasing quality level of the data grouped by quality level and wine’s color. It shows that the relation between residual sugar and density is not so stong as what I though at high quality level.


Residual Sugar vs Alcohol

Correlation Coefficient Information

##   color quality_level        corr
## 1   red           Low  0.09424935
## 2   red        Medium -0.02603628
## 3   red          High  0.07175752
## 4 white           Low -0.43353934
## 5 white        Medium -0.45499608
## 6 white          High -0.48392064

As what we know that alcohol is generated by the fermentation of the sugar (wikipedia). Additionally, the relationship between density and alcohol and the one between density and residual sugar. I found that density has high relationship with residual sugar in white wine. Therefore, I would like to take a look at the relationship between alcohol and residual sugar. As what we see, it can be thought that the realtionships are quite weak in red wine. The correlation coefficient at all quality level are almost 0. On the other hand, the relationships are medium, which are -0.43, -0.45 and -0.48 at each quality level respectively, in white wine. In my opinion, it might imply that the fermentation of red wine is almost done so that the content of residaul sugar affect the qualtiy very slightly. However, the degree of the fermentation of white wine seems affect the quality much more.

Outliers Handling

Correlation Coefficient Information

## ===============================================
##  Grouped by Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             0.14 
##  Red     |    Medium          0.11 
##          |    High            0.201 
##  ----------------------------------------------
##          |    Low             -0.467 
##  White   |    Medium          -0.501 
##          |    High            -0.52 
##  ==============================================
##  Grouped by Quality Level and Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             0.029 
##  Red     |    Medium          0.102 
##          |    High            0.199 
##  ----------------------------------------------
##          |    Low             -0.47 
##  White   |    Medium          -0.497 
##          |    High            -0.111 
##  ===============================================

After the outliers handling, the distribution of the scatter points of red wine become very narrow. And the correlation coefficients also do not change. Namely, there is no relation between alcohol and residual sugar for red wine. On the other hand, the results of different grouping ways are qutie different. The result of the data grouped by wine’s color shows that outliers handing strengthen the relation between alcohol and residual sugar in white win, especially for that the correlation is getting strong with increasing quality level. However, the result of the data grouped by quality level and wine’s color shows the opposite information. The relation increase with increasing quality level from low level to medium level, and then it decreases dramatically with increasing quality level from medium level to high level. It indicates that it is weak or even no relation between alcohol and residual sugar in white wine at high quality level. Thal also implies that the fermentation is almost done for white wine at high quality level.


Density vs Fixed Acidity

Correlation Coefficient Information

##   color quality_level      corr
## 1   red           Low 0.6892823
## 2   red        Medium 0.7004321
## 3   red          High 0.7817219
## 4 white           Low 0.1706753
## 5 white        Medium 0.2223461
## 6 white          High 0.4374344

In contrast to the relationship between density and residual sugar, the relationships between fixed acidity and density in red wine are strong at all quality level. Similarly, the scatter points of red wine also shows that fixed acidity tends to increase with increasing density. On the oter hand, the relationships in white wine are quite weak. The scatter points of white wine shows that there is almost no relationships between fixed acidity and density.

Outliers Handling

Correlation Coefficient Information

## ===============================================
##  Grouped by Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             0.658 
##  Red     |    Medium          0.591 
##          |    High            0.617 
##  ----------------------------------------------
##          |    Low             0.082 
##  White   |    Medium          0.209 
##          |    High            0.415 
##  ==============================================
##  Grouped by Quality Level and Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             0.648 
##  Red     |    Medium          0.62 
##          |    High            0.656 
##  ----------------------------------------------
##          |    Low             0.085 
##  White   |    Medium          0.204 
##          |    High            0.349 
##  ===============================================

After outliers handling, all the distribution of scatter points of both red wine and white wine become much smaller. the shapes of distribution of red wine look like oval a little bit. On the other hand, the shope of white wine become quite round. The informaiton of correlation coefficient let us know that removing outliers of data of red wine make the retion a little bit weak. But it make no difference to the data of white wine.


Density vs Citric Acid

Correlation Coefficient Information

##   color quality_level      corr
## 1   red           Low 0.4507759
## 2   red        Medium 0.3745896
## 3   red          High 0.5163765
## 4 white           Low 0.2339382
## 5 white        Medium 0.1021815
## 6 white          High 0.1284902

It is reasonable that this plot is similar to that of the relationship between density and fixed acidity, because citric acid is one of major acidity in fixed acid (Fixed Acidity). However, fixed acitiy contains other acid such as tartaric, malic, and succinic acid. So this relationships are weaker that the relationships between density and fixed acidity.

Outliers Handling

Correlation Coefficient Information

## ===============================================
##  Grouped by Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             0.423 
##  Red     |    Medium          0.268 
##          |    High            0.347 
##  ----------------------------------------------
##          |    Low             0.042 
##  White   |    Medium          0.045 
##          |    High            0.078 
##  ==============================================
##  Grouped by Quality Level and Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             0.349 
##  Red     |    Medium          0.313 
##          |    High            0.382 
##  ----------------------------------------------
##          |    Low             0.17 
##  White   |    Medium          0.045 
##          |    High            0.081 
##  ===============================================

The results of the outliers handling here are almost as the same as the previous results (density vs fixed acidity).


Fixed Acidity vs Citric Acid

Correlation Coefficient Information

##   color quality_level      corr
## 1   red           Low 0.6267012
## 2   red        Medium 0.6743878
## 3   red          High 0.7452792
## 4 white           Low 0.3146377
## 5 white        Medium 0.2813673
## 6 white          High 0.2540356

In this plot, it shows that citric acid tends to increase with increasing fixed acidity in both red wine and white wine. The difference is that the higher the quality of red wine is the stronger the relationship is. On the other hand, it does not show this phenomenon in white wine. Besides, the higher quality level has the higher correlation coefficient in red wine. But it is in contrast in white wine.

Outliers Handling

Correlation Coefficient Information

## ===============================================
##  Grouped by Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             0.615 
##  Red     |    Medium          0.619 
##          |    High            0.707 
##  ----------------------------------------------
##          |    Low             0.245 
##  White   |    Medium          0.29 
##          |    High            0.217 
##  ==============================================
##  Grouped by Quality Level and Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             0.55 
##  Red     |    Medium          0.648 
##          |    High            0.722 
##  ----------------------------------------------
##          |    Low             0.276 
##  White   |    Medium          0.299 
##          |    High            0.179 
##  ===============================================

Similarly, the results of outliers handling here are almost as the same as the previous results. But it is a little bit different in the distribution of the scatter points between the one gourped by wine’s color and the one grouped by quality level and wine’s color. We can see that the distribution of that grouped by wine’s color are smaller.


Citric Acid vs Volatile Acidity

Correlation Coefficient Information

##   color quality_level       corr
## 1   red           Low -0.4932947
## 2   red        Medium -0.5720388
## 3   red          High -0.4947980
## 4 white           Low -0.1764287
## 5 white        Medium -0.1020559
## 6 white          High -0.2356505

As wikipedia says that citric acid would eventually be converted into acetic acid which is the main compound in volatile acidity. Thus, as what we can see in this plot, citric acid tends to decrease with increasing volatile acidity in red wine. However, there is no such tendency in white wine. In my opinion, it might be that the primary alcohol fermentation should be almost done in red wine due to the tendency of yeast to convert citric into acetic acid.

Outliers Handling

Correlation Coefficient Information

## ===============================================
##  Grouped by Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             -0.557 
##  Red     |    Medium          -0.67 
##          |    High            -0.685 
##  ----------------------------------------------
##          |    Low             -0.158 
##  White   |    Medium          -0.095 
##          |    High            -0.15 
##  ==============================================
##  Grouped by Quality Level and Wine's Color
##  color   |   quality level     corr
##  ----------------------------------------------
##          |    Low             -0.53 
##  Red     |    Medium          -0.657 
##          |    High            -0.327 
##  ----------------------------------------------
##          |    Low             -0.115 
##  White   |    Medium          -0.079 
##          |    High            -0.234 
##  ===============================================

Outliers handling make the distribution of scatter points of white wine quite round and small, but the correlation coefficients do not change too much. It implies that the outliers do not affect the relation so much. On the other hand, we can find out that the results are quite different in red wine between the one grouped by wine’s color and the one grouped by quality level and wine’s color. In the result of the one grouped by wine’s color, the relation at all the quality level are strengthened by removing the outliers. However, the change of the relation of the one grouped by quality level and wine’s color is so different It increases with increaing quality level from low level to medium level, and then it decreases to the value as 1/2 as that at medium quality level.


Investigate the Quantity of Wine Parameters with Quality Level

In this section, the plot with outliers handling just used to compare with the plot without outliers handling. Bascially, the box plots with outliers handling just make the tendency more clear. There is no much change in all the plots.

Quality vs Alcohol

##   color quality_level Min. 1st Qu. Median   Mean 3rd Qu. Max.
## 1   red           Low  8.4     9.4    9.7  9.926    10.3 14.9
## 2   red        Medium  8.4     9.8   10.5 10.630    11.3 14.0
## 3   red          High  9.2    10.8   11.6 11.520    12.2 14.0
## 4 white           Low  8.0     9.2    9.6  9.850    10.4 13.6
## 5 white        Medium  8.5     9.6   10.5 10.580    11.4 14.0
## 6 white          High  8.5    10.7   11.5 11.420    12.4 14.2

##   color quality_level Min. 1st Qu. Median   Mean 3rd Qu. Max.
## 1   red           Low  9.0     9.4   9.60  9.772   10.00 11.5
## 2   red        Medium  8.7     9.8  10.50 10.560   11.20 13.3
## 3   red          High  9.5    10.9  11.30 11.420   11.92 13.6
## 4 white           Low  8.0     9.2   9.55  9.793   10.30 12.6
## 5 white        Medium  8.5     9.7  10.50 10.620   11.40 14.0
## 6 white          High  8.7    11.0  11.75 11.700   12.50 14.2

The overall tendency of both red wine and white wine show that both quality increases with increasing alcohol content. If we focuse on the low quality level, it shows no relationship in red wine, and it shows that quality increases with decreasing alcohol content in white wine. On the other hand, quality increases with increasing alcoho content in both red wine and white wine at the medium and high quality level. Besides, red wine and white wine have similar alcohol content.


Quality vs Residual Sugar

##   color quality_level Min. 1st Qu. Median  Mean 3rd Qu.  Max.
## 1   red           Low  1.2     1.9  2.200 2.542    2.60 15.50
## 2   red        Medium  0.9     1.9  2.200 2.477    2.50 15.40
## 3   red          High  1.2     2.0  2.300 2.709    2.70  8.90
## 4 white           Low  0.6     1.7  6.625 7.054   11.02 23.50
## 5 white        Medium  0.7     1.7  5.300 6.442    9.90 65.80
## 6 white          High  0.8     1.8  3.875 5.262    7.40 19.25

##   color quality_level Min. 1st Qu. Median  Mean 3rd Qu. Max.
## 1   red           Low  1.2     1.9    2.1 2.185     2.5  3.5
## 2   red        Medium  1.2     1.9    2.1 2.132     2.4  3.0
## 3   red          High  1.4     1.8    2.1 2.184     2.5  3.6
## 4 white           Low  0.6     1.8    7.2 7.433    11.7 23.5
## 5 white        Medium  0.7     1.7    5.2 6.257     9.6 20.8
## 6 white          High  0.8     1.7    2.9 4.071     5.8 14.8

It is obviously that white wine has much more residual sugar than red wine.


Quality vs Total Acidity

##   color quality_level  Min. 1st Qu. Median  Mean 3rd Qu.  Max.
## 1   red           Low 5.120   7.705   8.39 8.732   9.480 16.26
## 2   red        Medium 5.300   7.605   8.40 8.845   9.881 14.61
## 3   red          High 5.320   7.780   9.04 9.253  10.490 16.28
## 4 white           Low 4.415   6.678   7.16 7.272   7.760 12.03
## 5 white        Medium 4.110   6.550   7.03 7.098   7.568 14.47
## 6 white          High 4.125   6.505   6.98 6.990   7.482  9.45

##   color quality_level  Min. 1st Qu. Median  Mean 3rd Qu.  Max.
## 1   red           Low 5.520   7.722  8.340 8.533   9.388 11.67
## 2   red        Medium 5.980   7.640  8.285 8.620   9.390 12.68
## 3   red          High 6.500   8.210  9.305 9.377  10.560 12.39
## 4 white           Low 4.415   6.670  7.100 7.182   7.680 10.47
## 5 white        Medium 5.090   6.520  7.000 7.039   7.520  9.04
## 6 white          High 5.230   6.490  6.910 6.933   7.350  8.80

In this plot, I crated one variable, total acidity which is calcualted by sum fixed acidity and volatile acidity. I did not count citric acid because I think that citric acid should be one compound in fixed acidity. As we can see on, it is obviously that red wine has more acidity than white wine. The interesting thing is that the acidity in white wine at high quality level concentrates on about 7.5 g/dm³.


Quality vs Total Surful Dioxide

##   color quality_level Min. 1st Qu. Median   Mean 3rd Qu. Max.
## 1   red           Low    6   23.75     45  54.65      78  155
## 2   red        Medium    6   23.00     35  40.87      54  165
## 3   red          High    7   17.00     27  34.89      43  289
## 4 white           Low    9  117.00    149 148.60     182  440
## 5 white        Medium   18  107.20    132 137.00     164  294
## 6 white          High   34  101.00    122 125.20     146  229

##   color quality_level Min. 1st Qu. Median   Mean 3rd Qu. Max.
## 1   red           Low    7   26.00   44.5  53.95      73  147
## 2   red        Medium    6   23.25   34.0  38.11      49   94
## 3   red          High    7   15.75   24.0  26.14      35   56
## 4 white           Low   19  117.00  149.0 149.10     182  260
## 5 white        Medium   24  107.00  130.0 135.50     162  248
## 6 white          High   53   98.00  116.0 118.00     136  203

According to the wine quality information document, sulfur dioxide prevents microbial growth and the oxidation of wine. As we can see, Much more sulfur dioxide is used in white wine to keep white wine’s quality. Especially, the used quantity of sulfur dioxide might be controled to the range 116 to 121 mg/dm³.

Quick Review of Quality vs Features Grouped by Quality Level

In the boxplot plot, it shows that white wine has higher residual sugar and total sulfur dioxide, especially for that the quantities at the high quality level have relative small range, which implies that the quantities of these feature are controlled in my poinion. On the other hand, red wine has higher acidity. The interesting thing is that the red wine with higher fixed acidity and low volatile acidity has better quality.


Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

With the consideration of outliers handling, the conclusions are shown belowe.

Red wine with higher alcohol content, higher fixed acidity (citric acid) and lower volatile acidity seems have better quality.

White wine with higher alcohol content and lower residual sugar at low and medium quality level, but it seems that there is weak or even no relation at high quality level. Besides,total sulfur dioxide controlled at about 120 mg/dm³ seems have better quality.

Were there any interesting or surprising interactions between features?

According to the analysis of the relationships amoung alcohol, density, residual sugar and acditiy, I found out that it seems that the fermentation of red wine is almost done but the fermentation of white wine is controlled at some level, especially for those at low and medium level. When I looked at the relationship between alcohol content and residual sugar, there is no apparent relationship in red wine and strong negative relationship in white wine. However, compared with the results with outliers handling, I found that there is weak or no realtion between alcohol and residual sugar, which is quite similr to that of red wine. It might be that the fermentation is almost done for the white wine at high quality level. As what we know that alcohol is generated by the fermentation of sugar, the box plot also show that the alcohol content of white wine is highest at high quality level. Additionally, the relationship between citric acid and fixed acidity and the relationship between citric acid and volatile acidity also show similar conclusion.



Final Plots and Summary

Plot One: Primary feature which Affect Wine Quality in Red Wine and White Wine.

Description One

Alcohol content is the strongest relationship between a featrue and quality in both red wine and white wine. The overall tendency of both red wine and white wine show that quality increases with increasing alcohol content. However, If we focuse on the low quality level, the tnedency is not clear in red wine; and the alcohol content tends to decrease with increasing quality. On the other hand, the higher the alcohol content is the higher quality is at medium and high quality level for both red wine and white wine. Besides, after outliers handling, the higher the quality is the smaller the range of alcohol is for both red wine and white wine. However, there is no such phenomenon in red wine at low quality level.


Plot Two: Secondary feature which Affect Wine Quality in Red Wine.

Description Two

By combining a scatter plot with density plots of the x- and y-axis variables, it is easier to see that a tnedency from low to high quality level. For high quality level, it seems that total acidity and alcohol contents have a wide range. But for medium quality level, its alcohol content is also quite wide, but not total acidity. For low quality level, both alcohol content and total acidity concentrate on low range. It might be hard to distinguish the red wine at high level and medium level, but the red wine at low level has lower total acidity and alcohol content. Thus, if we focus on the scatter plot, the data of high quality level occupies a quite range, but the data of low quality level concentrates on the range at bottom. In my opinion, you probably have red wine with high quality if you feel a little sour and get drunk.


Plot Three: Secondary feature which Affect Wine Quality in White Wine.

Description Three:

Actually, it is a little bit hard to observe the scatter plot. But the density plots help us to have an insight into the data. It is obvious that the data of high quality level almost locate at left upper side, and the data of medium quality level locates at the left middle part, and the data of low quality level locate at the left bottom side. According to the density plots, high quality level has lower and narrow range of residual sugar but has high and wide range of alcohol content, which means that alcohol content is almost the same with changing residual surgar. Medium quality level has wide range of both residual sugar and alcohol content, which some how alcohol change a little bit with changing residual sugar. Low quality level has wide and low range of residual sugar but narrow and low range of alcohol, and the changing of alcohol relative to reisdual sugar is similar to that of medium quality level. Recalling the plots of alcohol vs residual sugar in Multivariate section, this plot also implies the similar information, the fermentation of high quality level might be almost done so that there is no relation between alcohol and reisdual sugar, but the fermentaion of both medium quality level and low quality level might be still processing. So, it strengthens my thought again, about that the progress of fermentation might affect the quality of white wine.. Personally, I like wine sweeter. But after this analysis, I will try to have a bitter and strong white wine at the next time.



Reflection

I was thinking that it should be quite different between red wine and white wine. Thus, I decided to explore the data of red wine and white wine at the same time. At the beginning of this project, I was struggling with how to provide a meaningful project. I decided to take a look at the individual histrogram distribution of the features to get some feel for each one. Althought the distributions of each feature are quite similar between red wine and white wine, I still found that the quantities and tendencies are quite different between red wine and white wine as whtat I thought that there must be something qutite different between red wine and white wine. Thus, I was not so suprised to the difference between red wine and whihte wine in the results distributions.

In bivariate plots section and analysis, I was struggling with what kind of data exploratory I should perform. The results of ggcorr function did help me a lot. The plot of ggcorr function is convenient to observe the relationships between the features. I found that red wine and white wine have quite difference in the relationships between the features, especially residual sugar and acidity. I picked up several relationships which I am interesed to check their scatter point distrbutions. As the result, I found out that residual sugar have high influence on white wine, and acidity have high influence on red wine.

At the beginning of this report, I created a variable, quality_level, classifying the quality into the three levels, low, medium and high. This variable was used in Multivariate Plots Section to check the effect of each feature on the quality. As what I expected, the tendencies are different at differnt quality level. That strengthened my confidence to that residual sugar have high influence on white wine, and acidity have high influence on red wine.

I did not investigate all the relationship between the features. However, low correlation coefficient does not mean that the corresponding features have no influence on the quality. For example, suphates is the additive which contribute to sulfur dioxide gas. It should have some influence on white wine. Besides, I found out that there are lots of outliers in different features. But I did clean the data to make the results more reliable. Additionally, I would like to know the years or vintage of each individual wine, because the years and vintage have great influence on wine’s quality. These kinds of analysis of red/white wine should be interesting as the future work.

According to the suggestion given by the previous reviwer, I did the analysis of outliers handling. At the beginning, I was frustrated with how to perform this analysis. As the basic knowledge I learned in first project of this nanodegree, I know IQR rule (1.5*IQR ± Q1 or Q3) is one of the way to detect outliers. According to study outliers handling, I have looked for lots of information. There are many ways like deivation, regression analysis and cook’s distance to detect the outliers. Besides, I noticed that the new outliers will come out when I replace the old outliers with NA. The change of the sample numbers results in the new outliers. In order to eliminate the outliers, I did the iteration of outliers handling. Actually, I am not sure whether it is a good idea or not, because I do not have much experience of it. Also, as the information I got, some people replace the outliers with capping value, mean value or median value. The results of outliers hadling really affected by how I group the data during the process. It might not be hard to observer when the scale of group is big like using color to group. But the difference can be seen when the group is getting small like using quality level even quality to group the data. Besides, I also fund package called “outliers” and “mvoutlers” in R. But I did not use it. Because , it is my first time to do outlier handling. Althoug it took lots of my time to code the program of outliers handling, I have learned very much. Of course, I did not do refactoring of my code, I also found some of problems like the outliers would not be removed very clean. I think it is turely interesting to do the analysis which focouses on outliers handling, whcih includes refactoring of the code as the study in the future .

It was my first time to use R to do the analysis. I spent lots of time on learning the skills of R, especially for plotting. In my opinion, it is truly convenient to do data exploratory or analysis with R. It has lots of powerful functions to promote the efficiency. Besides, in order to make the explaination meaning ful, I did looked for severl information about wine, especially for sulfur dioxide, residual sugar and acidity (fixed acidity and volatile acidity).